skip to main content


Search for: All records

Creators/Authors contains: "Bader, David A."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. The suffix array is a crucial data structure for efficient string analysis. Over the course of twenty-six years, sequential suffix array construction algorithms have achieved O(n) time complexity and in-place sorting. In this paper, we present the Tunnel algorithm, the first large-scale parallel suffix array construction algorithm with a time complexity of O(n/p) based on the parallel random access machine (PRAM) model. The Tunnel algorithm is built on three key ideas: dividing the problem of size O(n) into p sub-problems of reduced size O(n/p) by replacing long suffixes with shorter prefixes of size at most a constant D ; introducing a Tunnel mechanism to efficiently induce the order of a set of suffixes with long common prefixes; developing a strategy to transform a partially ordered suffix set into a total order relation by iteratively applying the Tunnel inducing method. We provide a detailed description of the algorithm, along with a thorough analysis of its time and space complexity, to demonstrate its correctness and efficiency. The proposed Tunnel algorithm exhibits scalable performance, making it suitable for large string analytics on large-scale parallel systems. 
    more » « less
    Free, publicly-accessible full text available December 1, 2024
  2. This paper presents an algorithm for detecting attributed high-degree node isomorphism. High-degree isomorphic nodes seldom happen by chance and often represent duplicated entities or data processing errors. By definition, isomorphic nodes are topologically indistinguishable and can be problematic in graph ML tasks. The algorithm employs a parallel, “degree-bounded” approach that fingerprints each node’s local properties through a hash, which constrains the search to nodes within hash-defined buckets, thus minimising the number of comparisons. This method scales on graphs with billions of nodes and edges. Finally, we provide isomorphic node oddities identified in real-world data. 
    more » « less
    Free, publicly-accessible full text available May 1, 2024
  3. Given an undirected, weighted graph, the minimum spanning tree (MST)is a tree that connects all of the vertices of the graph with minimum sum of edge weights. In real world applications, network designers often seek to quickly find a replacement edge for each edge in the MST. For example, when a traffic accident closes a road in a transportation network, or a line goes down in a communication network, the replacement edge may reconnect the MST at lowest cost. In the paper, we consider the case of finding the lowest cost replacement edge for each edge of the MST. A previous algorithm by Tarjan takes O{m lpha(m, n)} time and space, where $lpha(m, n)$ is the inverse Ackermann’s function. Given the MST and sorted non-tree edges, our algorithm is the first practical algorithm that runs in O(m+n) time and O(m+n) space to find all replacement edges. Additionally, since the most vital edge is the tree edge whose removal causes the highest cost, our algorithm finds it in linear time. 
    more » « less
  4. etecting valuable anomalies with high accuracy and low latency from large amounts of streaming data is a challenge. This article focuses on a special kind of stream, the catalog stream, which has a high-level structure to analyze the stream effectively. We first formulate the anomaly detection in catalog streams as a constrained optimization problem based on a catalog stream matrix. Then, a novel filtering-identifying based anomaly detection algorithm (FIAD) is proposed, which includes two complementary strategies, true event identifying and false alarm filtering. Different kinds of attention windows are developed to provide corresponding data for various algorithm components. The identifying strategy includes true events in a much smaller candidate set. Meanwhile, the filtering strategy significantly removes false positives. A scalable catalog stream processing framework CSPF is designed to support the proposed method efficiently. Extensive experiments are conducted on the catalog stream data sets from an astronomy observation. The experimental results show that the proposed method can achieve a false-positive rate as low as 0.04%, reduces the false alarms by 98.6% compared with the existing methods, and the latency to handle each catalog is 2.1 seconds. Furthermore, a total of 36 transient candidates are detected from one observation season. 
    more » « less
  5. Identifying key members in large social network graphs is an important graph analytic. Recently, a new centrality measure called triangle centrality finds members based on the triangle support of vertices in graph. In this paper, we describe our rapid implementation of triangle centrality using Graph-BLAS, an API specification for describing graph algorithms in the language of linear algebra. We use triangle centrality’s algebraic algorithm and easily implement it using the SuiteSparse GraphBLAS library. A set of experiments on large, sparse graph datasets is conducted to verify the implementation. 
    more » « less
  6. Exploratory graph analytics helps maximize the informational value from a graph. However, the increasing graph size makes it impossible for existing popular exploratory data analysis tools to handle dozens-of-terabytes or even larger data sets in the memory of a common laptop/personal computer. Arkouda is a framework under early-development that brings together the productivity of Python at the user side with the high-performance of Chapel at the server side. In this paper, we present preliminary work on overcoming the memory limit and high performance computing coding roadblock for high level Python users to perform large graph analysis. A simple and succinct graph data structure design and implementation at both the Python front-end and the Chapel back-end in the Arkouda framework are provided. A typical graph algorithm, Breadth-First Search (BFS), is used to show how we can use Chapel to develop high performance parallel graph algorithm productively. Two Chapel-based parallel Breadth-First Search (BFS) algorithms, one high level version and one corresponding low level version, have been implemented in Arkouda to support analyzing large graphs. Multiple graph benchmarks are used to evaluate the performance of the provided graph algorithms. Ex- perimental results show that we can optimize the performance by tuning the selection of different Chapel high level data structures and parallel constructs. Our code is open source and available from GitHub (https://github.com/Bader-Research/arkouda). 
    more » « less
  7. Large scale data sets from the web, social networks, and bioinformatics are widely available and can often be represented by strings and suffix arrays are highly efficient data structures enabling string analysis. But, our personal devices and corresponding exploratory data analysis (EDA) tools cannot handle big data sets beyond the local memory. Arkouda is a framework under early development that brings together the productivity of Python at the user side with the high-performance of Chapel at the server-side. In this paper, an efficient suffix array data structure design and integration method are given first. A suffix array algorithm library integration method instead of one single suffix algorithm is presented to enable runtime performance optimization in Arkouda since different suffix array algorithms may have very different practical performances for strings in various applications. A parallel suffix array construction algorithm framework is given to further exploit hierarchical parallelism on multiple locales in Chapel. A corresponding benchmark is developed to evaluate the feasibility of the provided suffix array integration method and measure the end-to-end performance. Experimental results show that the proposed solution can provide data scientists an easy and efficient method to build suffix arrays with high performance in Python. All our codes are open source and available from GitHub (https://github.com/Bader-Research/arkouda/tree/string-suffix-array-functionality). 
    more » « less
  8. The transitive closure of a graph is a new graph where every vertex is directly connected to all vertices to which it had a path in the original graph. Transitive closures are useful for reachability and relationship querying. Finding the transitive closure can be computationally expensive and requires a large memory footprint as the output is typically larger than the input. Some of the original research on transitive closures assumed that graphs were dense and used dense adjacency matrices. We have since learned that many real-world networks are extremely sparse, and the existing methods do not scale. In this work, we introduce a new algorithm called Anti-section Transitive Closure (ATC) for finding the transitive closure of a graph. We present a new parallel edges operation – anti-sections – for finding new edges to reachable vertices. ATC scales to massively multithreaded systems such as NVIDIA’s GPU with tens of thousands of threads. We show that the anti-section operation shares some traits with the triangle counting intersection operation in graph analysis. Lastly, we view the transitive closure problem as a dynamic graph problem requiring edge insertions. By doing this, our memory footprint is smaller. We also show a method for creating the batches in parallel using two different techniques: dual-round and hash. Using these techniques and the Hornet dynamic graph data structure, we show our new algorithm on an NVIDIA Titan V GPU. We compare with other packages such as NetworkX, SEI-GBTL, SuiteSparse, and cuSparse. 
    more » « less
  9. Identifying anomalies, especially weak anomalies in constantly changing targets, is more difficult than in stable targets. In this article, we borrow the dynamics metrics and propose the concept of dynamics signature (DS) in multi-dimensional feature space to efficiently distinguish the abnormal event from the normal behaviors of a variable star. The corresponding dynamics criterion is proposed to check whether a star's current state is an anomaly. Based on the proposed concept of DS, we develop a highly optimized DS algorithm that can automatically detect anomalies from millions of stars' high cadence sky survey data in real-time. Microlensing, which is a typical anomaly in astronomical observation, is used to evaluate the proposed DS algorithm. Two datasets, parameterized sinusoidal dataset containing 262,440 light curves and real variable stars based dataset containing 462,996 light curves are used to evaluate the practical performance of the proposed DS algorithm. Experimental results show that our DS algorithm is highly accurate, sensitive to detecting weak microlensing events at very early stages, and fast enough to process 176,000 stars in less than 1 s on a commodity computer. 
    more » « less
  10. Data from emerging applications, such as cybersecurity and social networking, can be abstracted as graphs whose edges are updated sequentially in the form of a stream. The challenging problem of interactive graph stream analytics is the quick response of the queries on terabyte and beyond graph stream data from end users. In this paper, a succinct and efficient double index data structure is designed to build the sketch of a graph stream to meet general queries. A single pass stream model, which includes general sketch building, distributed sketch based analysis algorithms and regression based approximation solution generation, is developed, and a typical graph algorithm—triangle counting—is implemented to evaluate the proposed method. Experimental results on power law and normal distribution graph streams show that our method can generate accurate results (mean relative error less than 4%) with a high performance. All our methods and code have been implemented in an open source framework, Arkouda, and are available from our GitHub repository, Bader-Research. This work provides the large and rapidly growing Python community with a powerful way to handle terabyte and beyond graph stream data using their laptops. 
    more » « less